Generalized Distributive Law
   HOME

TheInfoList



OR:

The generalized distributive law (GDL) is a generalization of the
distributive property In mathematics, the distributive property of binary operations generalizes the distributive law, which asserts that the equality x \cdot (y + z) = x \cdot y + x \cdot z is always true in elementary algebra. For example, in elementary arithmetic, ...
which gives rise to a general
message passing In computer science, message passing is a technique for invoking behavior (i.e., running a program) on a computer. The invoking program sends a message to a process (which may be an actor or object) and relies on that process and its supporting i ...
algorithm. It is a synthesis of the work of many authors in the
information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...
,
digital communications Data transmission and data reception or, more broadly, data communication or digital communications is the transfer and reception of data in the form of a digital bitstream or a digitized analog signal transmitted over a point-to-point or ...
,
signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, and scientific measurements. Signal processing techniq ...
,
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, and
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
communities. The law and algorithm were introduced in a semi-tutorial by Srinivas M. Aji and Robert J. McEliece with the same title.


Introduction

''"The distributive law in mathematics is the law relating the operations of multiplication and addition, stated symbolically, a*(b + c) = a*b + a*c; that is, the monomial factor a is distributed, or separately applied, to each term of the binomial factor b + c , resulting in the product a*b + a*c "'' - Britannica As it can be observed from the definition, application of distributive law to an arithmetic expression reduces the number of operations in it. In the previous example the total number of operations reduced from three (two multiplications and an addition in a*b + a*c ) to two (one multiplication and one addition in a*(b + c) ). Generalization of distributive law leads to a large family of fast algorithms. This includes the
FFT A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Fourier analysis converts a signal from its original domain (often time or space) to a representation in the ...
and
Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especiall ...
. This is explained in a more formal way in the example below: \alpha(a,\, b) \stackrel \displaystyle\sum \limits_ f(a, \, c, \, b) \, g(a, \, d, \, e) where f(\cdot) and g(\cdot) are real-valued functions, a,b,c,d,e \in A and , A, =q (say) Here we are "marginalizing out" the independent variables (c, d, and e) to obtain the result. When we are calculating the computational complexity, we can see that for each q^ pairs of (a,b), there are q^ terms due to the triplet (c,d,e) which needs to take part in the evaluation of \alpha(a,\, b) with each step having one addition and one multiplication. Therefore, the total number of computations needed is 2\cdot q^2 \cdot q^3 = 2q^5. Hence the asymptotic complexity of the above function is O(n^5). If we apply the distributive law to the RHS of the equation, we get the following: : \alpha(a, \, b) \stackrel \displaystyle\sum\limits_ f(a, \, c, \, b ) \cdot \sum _ g(a,\,d,\,e) This implies that \alpha(a, \, b) can be described as a product \alpha_(a,\, b) \cdot \alpha_(a) where \alpha_(a,b) \stackrel \displaystyle\sum\limits_ f(a, \, c, \, b ) and \alpha_(a) \stackrel \displaystyle\sum\limits_ g(a,\, d, \,e ) Now, when we are calculating the computational complexity, we can see that there are q^ additions in \alpha_(a,\, b) and \alpha_(a) each and there are q^2 multiplications when we are using the product \alpha_(a,\, b) \cdot \alpha_(a) to evaluate \alpha(a, \, b). Therefore, the total number of computations needed is q^3 + q^3 + q^2 = 2q^3 + q^2. Hence the asymptotic complexity of calculating \alpha(a,b) reduces to O(n^) from O(n^). This shows by an example that applying distributive law reduces the computational complexity which is one of the good features of a "fast algorithm".


History

Some of the problems that used distributive law to solve can be grouped as follows 1. Decoding algorithms
A GDL like algorithm was used by Gallager's for decoding low density parity-check codes. Based on Gallager's work Tanner introduced the
Tanner graph In coding theory, a Tanner graph, named after Michael Tanner, is a bipartite graph used to state constraints or equations which specify error correcting codes. In coding theory, Tanner graphs are used to construct longer codes from smaller ones. Bo ...
and expressed Gallagers work in message passing form. The tanners graph also helped explain the
Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especiall ...
. It is observed by Forney that Viterbi's maximum likelihood decoding of
convolutional codes In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of t ...
also used algorithms of GDL-like generality. 2. Forward-backward algorithm
The forward backward algorithm helped as an algorithm for tracking the states in the
markov chain A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happe ...
. And this also was used the algorithm of GDL like generality 3.
Artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...

The notion of
junction tree In graph theory, a tree decomposition is a mapping of a graph into a tree that can be used to define the treewidth of the graph and speed up solving certain computational problems on the graph. Tree decompositions are also called junction trees ...
s has been used to solve many problems in AI. Also the concept of
bucket elimination In constraint satisfaction, local consistency conditions are properties of constraint satisfaction problems related to the consistency of subsets of variables or constraints. They can be used to reduce the search space and make the problem easier t ...
used many of the concepts.


The MPF problem

MPF or marginalize a product function is a general computational problem which as special case includes many classical problems such as computation of discrete
Hadamard transform The Hadamard transform (also known as the Walsh–Hadamard transform, Hadamard–Rademacher–Walsh transform, Walsh transform, or Walsh–Fourier transform) is an example of a generalized class of Fourier transforms. It performs an orthogonal ...
,
maximum likelihood decoding In coding theory, decoding is the process of translating received messages into codewords of a given code. There have been many common methods of mapping messages to codewords. These are often used to recover messages sent over a noisy channel, su ...
of a
linear code In coding theory, a linear code is an error-correcting code for which any linear combination of codewords is also a codeword. Linear codes are traditionally partitioned into block codes and convolutional codes, although turbo codes can be seen as ...
over a memory-less
channel Channel, channels, channeling, etc., may refer to: Geography * Channel (geography), in physical geography, a landform consisting of the outline (banks) of the path of a narrow body of water. Australia * Channel Country, region of outback Austral ...
, and
matrix chain multiplication Matrix chain multiplication (or the matrix chain ordering problem) is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to ''perform'' the multiplications, but merely t ...
. The power of the GDL lies in the fact that it applies to situations in which additions and multiplications are generalized. A commutative semiring is a good framework for explaining this behavior. It is defined over a set K with operators "+" and "." where (K,\, +) and (K,\, .) are a
commutative monoid In abstract algebra, a branch of mathematics, a monoid is a set equipped with an associative binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being 0. Monoids ar ...
s and the distributive law holds. Let p_1, \ldots, p_n be variables such that p_1 \in A_1, \ldots, p_n \in A_ where A is a finite set and , A_i, = q_i. Here i = 1,\ldots, n. If S = \ and S \, \subset \, let A_ = A_ \times \cdots \times A_ , p_ = (p_,\ldots, p_), q_ = , A_, , \mathbf A = A_ \times \cdots \times A_ , and \mathbf p = \ Let S = \_^M where S_ \subset \. Suppose a function is defined as \alpha_: A_ \rightarrow R, where R is a commutative semiring. Also, p_ are named the ''local domains'' and \alpha_ as the ''local kernels''. Now the global kernel \beta : \mathbf A \rightarrow R is defined as : \beta(p_, ...\,, p_) = \prod_^M \alpha(p_) ''Definition of MPF problem'': For one or more indices i = 1, ...\,, M, compute a table of the values of S_-''marginalization'' of the global kernel \beta, which is the function \beta_:A_ \rightarrow R defined as \beta_(p_) \, = \displaystyle\sum\limits_ \beta(p) Here S_^c is the complement of S_ with respect to \mathbf \ and the \beta_i(p_) is called the i^ ''objective function'', or the ''objective function'' at S_i. It can observed that the computation of the i^ objective function in the obvious way needs Mq_1 q_2 q_3\cdots q_ operations. This is because there are q_1 q_2\cdots q_n additions and (M-1)q_1 q_2...q_n multiplications needed in the computation of the i^\text objective function. The GDL algorithm which is explained in the next section can reduce this computational complexity. The following is an example of the MPF problem. Let p_,\,p_,\,p_,\,p_, and p_ be variables such that p_ \in A_, p_ \in A_, p_ \in A_, p_ \in A_, and p_ \in A_. Here M=4 and S = \. The given functions using these variables are f(p_,p_,p_) and g(p_,p_) and we need to calculate \alpha(p_, \, p_) and \beta(p_) defined as: : \alpha(p_1, \, p_4) = \displaystyle\sum\limits_ f(p_1,\, p_2,\, p_5 ) \cdot g(p_2, \, p_4) : \beta(p_) = \sum\limits_ f(p_1, \, p_2, \, p_5) \cdot g(p_2, \, p_4) Here local domains and local kernels are defined as follows: where \alpha(p_, p_) is the 3^ objective function and \beta(p_) is the 4^ objective function. Consider another example where p_,p_,p_,p_,r_,r_,r_,r_ \in \ and f(r_,r_,r_,r_) is a real valued function. Now, we shall consider the MPF problem where the commutative semiring is defined as the set of real numbers with ordinary addition and multiplication and the local domains and local kernels are defined as follows: Now since the global kernel is defined as the product of the local kernels, it is : F(p_1, p_2, p_3,p_4, r_1, r_2, r_3,r_4) = f(p_1,p_2,p_3,p_4)\cdot(-1)^ and the objective function at the local domain p_1, p_2, p_3,p_4 is : F(p_1, p_2, p_3,p_4) = \displaystyle\sum \limits_ f(r_1,r_2,r_3,r_4) \cdot(-1)^. This is the
Hadamard transform The Hadamard transform (also known as the Walsh–Hadamard transform, Hadamard–Rademacher–Walsh transform, Walsh transform, or Walsh–Fourier transform) is an example of a generalized class of Fourier transforms. It performs an orthogonal ...
of the function f(\cdot). Hence we can see that the computation of
Hadamard transform The Hadamard transform (also known as the Walsh–Hadamard transform, Hadamard–Rademacher–Walsh transform, Walsh transform, or Walsh–Fourier transform) is an example of a generalized class of Fourier transforms. It performs an orthogonal ...
is a special case of the MPF problem. More examples can be demonstrated to prove that the MPF problem forms special cases of many classical problem as explained above whose details can be found at


GDL: an algorithm for solving the MPF problem

If one can find a relationship among the elements of a given set S, then one can solve the MPF problem basing on the notion of
belief propagation A belief is an attitude that something is the case, or that some proposition is true. In epistemology, philosophers use the term "belief" to refer to attitudes about the world which can be either true or false. To believe something is to take i ...
which is a special use of "message passing" technique. The required relationship is that the given set of local domains can be organised into a
junction tree In graph theory, a tree decomposition is a mapping of a graph into a tree that can be used to define the treewidth of the graph and speed up solving certain computational problems on the graph. Tree decompositions are also called junction trees ...
. In other words, we create a graph theoretic tree with the elements of S as the vertices of the
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
T, such that for any two arbitrary vertices say v_ and v_ where i \neq j and there exists an edge between these two vertices, then the intersection of corresponding labels, viz S_\cap S_, is a subset of the label on each vertex on the unique path from v_ to v_. For example, Example 1: Consider the following nine local domains: # \ # \ # \ # \ # \ # \ # \ # \ # \ For the above given set of local domains, one can organize them into a junction tree as shown below: Similarly If another set like the following is given Example 2: Consider the following four local domains: # \ # \ # \ # \ Then constructing the tree only with these local domains is not possible since this set of values has no common domains which can be placed between any two values of the above set. But however, if add the two dummy domains as shown below then organizing the updated set into a junction tree would be possible and easy too. 5.\
6.\ Similarly for these set of domains, the junction tree looks like shown below:


Generalized distributive law (GDL) algorithm

Input: A set of local domains.
Output: For the given set of domains, possible minimum number of operations that is required to solve the problem is computed.
So, if v_ and v_ are connected by an edge in the junction tree, then a message from v_ to v_ is a set/table of values given by a function: \mu_:A_ \rightarrow R. To begin with all the functions i.e. for all combinations of i and j in the given tree, \mu_ is defined to be identically 1 and when a particular message is update, it follows the equation given below. : \mu_(p_) = \sum_ \alpha _ (p_) \prod_ \mu_(p_)(1) where v_k \operatorname v_i means that v_ is an adjacent vertex to v_ in tree. Similarly each vertex has a state which is defined as a table containing the values from the function \sigma_: A_ \rightarrow R , Just like how messages initialize to 1 identically, state of v_ is defined to be local kernel \alpha(p_), but whenever \sigma_ gets updated, it follows the following equation: : \sigma(p_) = \alpha_i(p_) \prod_ \mu_(p_)(2).


Basic working of the algorithm

For the given set of local domains as input, we find out if we can create a junction tree, either by using the set directly or by adding dummy domains to the set first and then creating the junction tree, if construction junction is not possible then algorithm output that there is no way to reduce the number of steps to compute the given equation problem, but once we have junction tree, algorithm will have to schedule messages and compute states, by doing these we can know where steps can be reduced, hence will be discusses this below.


Scheduling of the message passing and the state computation

There are two special cases we are going to talk about here namely ''Single Vertex Problem'' in which the objective function is computed at only one vertex v_ and the second one is ''All Vertices Problem'' where the goal is to compute the objective function at all vertices. Lets begin with the single-vertex problem, GDL will start by directing each edge towards the targeted vertex v_0. Here messages are sent only in the direction towards the targeted vertex. Note that all the directed messages are sent only once. The messages are started from the leaf nodes(where the degree is 1) go up towards the target vertex v_0. The message travels from the leaves to its parents and then from there to their parents and so on until it reaches the target vertex v_0. The target vertex v_0 will compute its state only when it receives all messages from all its neighbors. Once we have the state, We have got the answer and hence the algorithm terminates. For Example, let us consider a junction tree constructed from the set of local domains given above i.e. the set from example 1,
Now the Scheduling table for these domains is (where the target vertex is p_2). \text
1.\mu_(p_) = \alpha_(p_)
2.\mu_(p_) = \Sigma_ \alpha_(p_,p_)
3.\mu_(p_) = \alpha_(p_)
4.\mu_(p_) = \Sigma_ \alpha_(p_,p_)
5.\mu_(p_) = \alpha_(p_)
6.\mu_(p_) = \Sigma_ \alpha_(p_,p_).\mu_(p_).\mu_(p_)
7.\mu_(p_) = \Sigma_ \alpha_(p_,p_).\mu_(p_).\mu_(p_)
8.\mu_(p_) = \Sigma_ \alpha_(p_,p_).\mu_(p_).\mu_(p_)
9.\sigma_(p_) = \alpha_(p_).\mu_(p_).\mu_(p_) Thus the complexity for Single Vertex GDL can be shown as \Sigma_ d(v), A_, arithmetic operations
Where (Note: The explanation for the above equation is explained later in the article )
S(v) is the label of v.
d(v) is the
degree Degree may refer to: As a unit of measurement * Degree (angle), a unit of angle measurement ** Degree of geographical latitude ** Degree of geographical longitude * Degree symbol (°), a notation used in science, engineering, and mathematics ...
of v (i.e. number of vertices adjacent to v). To solve the All-Vertices problem, we can schedule GDL in several ways, some of them are parallel implementation where in each round, every state is updated and every message is computed and transmitted at the same time. In this type of implementation the states and messages will stabilizes after number of rounds that is at most equal to the diameter of the tree. At this point all the all states of the vertices will be equal to the desired objective function. Another way to schedule GDL for this problem is serial implementation where its similar to the Single vertex problem except that we don't stop the algorithm until all the vertices of a required set have not got all the messages from all their neighbors and have compute their state.
Thus the number of arithmetic this implementation requires is at most \Sigma_ d(v), A_, arithmetic operations.


Constructing a junction tree

The key to constructing a junction tree lies in the local domain graph G_, which is a weighted complete graph with M vertices v_1,v_2,v_3,\ldots ,v_M i.e. one for each local domain, having the weight of the edge e_ : v_i \leftrightarrow v_j defined by
\omega_ = , S_ \cap S_, .
if x_ \in S_ \cap S_, then we say x_ is contained ine_. Denoted by \omega_ (the weight of a maximal-weight spanning tree of G_), which is defined by : \omega^ = \Sigma ^M_, S_, - n where ''n'' is the number of elements in that set. For more clarity and details, please refer to these.http://www-anw.cs.umass.edu/~cs691t/SS02/lectures/week7.PDF The Junction Tree Algorithm


Scheduling theorem

Let 'T' be a junction tree with vertex set 'V' and edge set 'E'. In this algorithm, the messages are sent in both the direction on any edge, so we can say/regard the edge set E as set of ordered pairs of vertices. For example, from Figure 1 'E' can be defined as follows : E = \ NOTE:E above gives you all the possible directions that a message can travel in the tree. The schedule for the GDL is defined as a finite sequence of subsets ofE. Which is generally represented by \mathcal =, Where E_ is the set of messages updated during the N^ round of running the algorithm. Having defined/seen some notations, we will see want the theorem says, When we are given a schedule \mathcal =\, the corresponding message trellis as a finite directed graph with Vertex set of V \times \, in which a typical element is denoted by v_(t) for t \in \, Then after completion of the message passing, state at vertex v_ will be the j^\text objective defined in : \sigma(p_) = \alpha_i(p_) \prod_ \mu_(p_) and iff there is a path from v_i(0) to v_j(N)


Computational complexity

Here we try to explain the complexity of solving the MPF problem in terms of the number of mathematical operations required for the calculation. i.e. We compare the number of operations required when calculated using the normal method (Here by normal method we mean by methods that do not use message passing or junction trees in short methods that do not use the concepts of GDL)and the number of operations using the generalized distributive law. Example: Consider the simplest case where we need to compute the following expression ab+ac. To evaluate this expression naively requires two multiplications and one addition. The expression when expressed using the distributive law can be written as a(b+c) a simple optimization that reduces the number of operations to one addition and one multiplication. Similar to the above explained example we will be expressing the equations in different forms to perform as few operation as possible by applying the GDL. As explained in the previous sections we solve the problem by using the concept of the junction trees. The optimization obtained by the use of these trees is comparable to the optimization obtained by solving a semi group problem on trees. For example, to find the minimum of a group of numbers we can observe that if we have a tree and the elements are all at the bottom of the tree, then we can compare the minimum of two items in parallel and the resultant minimum will be written to the parent. When this process is propagated up the tree the minimum of the group of elements will be found at the root. The following is the complexity for solving the junction tree using message passing We rewrite the formula used earlier to the following form. This is the eqn for a message to be sent from vertex ''v'' to ''w'' : \mu _ (p_) = \sum _ \alpha _ (p _) \prod _ \mu _ (p _) ----message equation Similarly we rewrite the equation for calculating the state of vertex v as follows : \sigma_v(p_v) = \alpha_v (p_v) \prod_ \mu _ (p _) We first will analyze for the single-vertex problem and assume the target vertex is v_0 and hence we have one edge from v to v _. Suppose we have an edge (v,w) we calculate the message using the message equation. To calculate p _ requires : q _ -1 additions and : q _ (d(v)-1) multiplications. (We represent the , A _, as q _.) But there will be many possibilities for x _ hence
q _ \stackrel , A _, possibilities for p _. Thus the entire message will need : (q _)(q _ -1) = q _ - q _ additions and : (q _) q _. (d(v) -1) = (d(v) -1) q _v multiplications The total number of arithmetic operations required to send a message towards v_0 along the edges of tree will be : \sum _ (q_v - q _) additions and : \sum _ (d(v) - 1) q_v multiplications. Once all the messages have been transmitted the algorithm terminates with the computation of state at v_0 The state computation requires d(v_0) q _0 more multiplications. Thus number of calculations required to calculate the state is given as below : \sum _ (q _ - q _) additions and : \sum _ (d(v) -1) q _ + d(v _)q _ multiplications Thus the grand total of the number of calculations is : \chi (T) = \sum _ d(v)q _ - \sum _ q _ ----(1) where e = (v,w) is an edge and its size is defined by q _ The formula above gives us the upper bound. If we define the complexity of the edge e = (v,w) as : \chi (e) = q _ + q _ - q _ Therefore, (1) can be written as : \chi(T) = \sum _ \chi (e) We now calculate the edge complexity for the problem defined in Figure 1 as follows : \chi(1,2) = q_2 + q_2 q_3 - q_2 : \chi(2,4) = q_3 q_4 + q_2 q_3 - q_3 : \chi(2,5) = q_3 + q_2 q_3 - q_3 : \chi(4,8) = q_4 + q_3 q_4 - q_4 : \chi(4,9) = q_2 q_4 + q_3 q_4 - q_4 : \chi(1,3) = q _2 + q_2 q_1 - q_2 : \chi(3,7) = q_1 + q_1 q_2 - q_1 : \chi(3,6) = q_1 q _4 + q _1 q_2 - q _1 The total complexity will be 3 q _q _ + 3q _q _+ 3 q _q _+q _q _ + q _q _ - q _ - q _ - q _ which is considerably low compared to the direct method. (Here by direct method we mean by methods that do not use message passing. The time taken using the direct method will be the equivalent to calculating message at each node and time to calculate the state of each of the nodes.) Now we consider the all-vertex problem where the message will have to be sent in both the directions and state must be computed at both the vertexes. This would take O( \sum _ d(v) d(v) q _) but by precomputing we can reduce the number of multiplications to 3(d-2). Here d is the degree of the vertex. Ex : If there is a set (a _, \ldots ,a _) with d numbers. It is possible to compute all the d products of d-1 of the a _ with at most 3(d-2) multiplications rather than the obvious d(d-2) . We do this by precomputing the quantities b_1 = a_1, b_2= b_1 \cdot a_2 = a_1 \cdot a _2, b _ = b _ \cdot a_ = a_1 a_2 \cdots a_ and c_d = a_d, c_ = a_ c_d = a _ \cdot a_d, \ldots , c_2 = a _2 \cdot c_3 = a _2 a_3 \cdots a_d this takes 2 (d-2) multiplications. Then if m_j denotes the product of all a_i except for a_j we have m_1 = c_2, m_2 = b_1 \cdot c_3 and so on will need another d-2 multiplications making the total 3 (d-2) There is not much we can do when it comes to the construction of the junction tree except that we may have many maximal weight spanning tree and we should choose the spanning tree with the least \chi(T) and sometimes this might mean adding a local domain to lower the junction tree complexity. It may seem that GDL is correct only when the local domains can be expressed as a junction tree. But even in cases where there are cycles and a number of iterations the messages will approximately be equal to the objective function. The experiments on Gallager–Tanner–Wiberg algorithm for low density parity-check codes were supportive of this claim. {{more citations needed, date=June 2012


References

Information theory Algorithms Graphical models Artificial intelligence Digital signal processing